libva-multiplanar/phase2_iter6_situation.md

# Phase 2 (iter6) — situation analysis: Firefox H.264 EINVAL

Opened 2026-05-05 immediately after iter6 Phase 1 lock on candidate I.

## Working hypothesis

The iter5-amend Firefox 150 binary (Utility seccomp fix shipped) loads the v4l2_request driver, sets up surfaces (`cap_pool_init: 24 slots ready`), then fails on either:
- `VIDIOC_S_EXT_CTRLS` EINVAL — observed on YouTube avc1 (Enhancer for YouTube forcing h264, multiple decode attempts)
- `VIDIOC_QBUF` EINVAL — observed on `bbb_1080p30_h264.mp4` direct file URL (single decode attempt)

mpv `--hwdec=vaapi-copy` decoded 2000 frames clean on the same iter5-end driver build. So this is consumer-specific (Firefox vs mpv) and stream-specific (S_EXT_CTRLS path vs QBUF path).

## Diagnostic build (iter6-DX)

Source tarball'd from the campaign fork at HEAD `c8b6ede` and pushed to ohm at `/home/mfritsche/iter6-fork-dx/`. Two diagnostics added (local-only, not committed):

1. `v4l2_set_controls`: per-control `VIDIOC_TRY_EXT_CTRLS` isolation when S_EXT_CTRLS EINVAL fires. Iterates each compound control alone, escapes the kernel cluster-commit obfuscation per `feedback_kernel_obfuscation_compound.md`. Reinstates the iter4 diagnostic (`f21bdf0`) that iter5 sweep removed (`848fc0c`).

2. `v4l2_queue_buffer`: full `v4l2_buffer` struct dump on QBUF EINVAL — type, memory, index, length, bytesused, timestamp, flags, request_fd. Untangles which field the kernel objects to.

Built on ohm with meson/ninja. Output sha `e01e2cc05b925cfdb2ad8e79a59ab796c90b9b20cdc4863a7c7f9c64624ed5a2`. Installed at `/usr/lib/dri/v4l2_request_drv_video.so`. Iter5-end driver backed up at `/home/mfritsche/v4l2_request_drv_video.so.iter5end.bak`.

## Telemetry runs

### Run 1 — bbb_1080p30_h264.mp4 (with iter6-DX driver, S_EXT_CTRLS diagnostic only)

```
v4l2-request: cap_pool_init: 24 slots ready (v4l2_index=0..23, 1 plane(s) per slot)
v4l2-request: Unable to queue buffer: Invalid argument
```

**Observation:** No ITER6_DX TRY-isolation lines fired → S_EXT_CTRLS is NOT failing on bbb. The QBUF EINVAL is the sole error. Single attempt, then Firefox falls back to SW.

This means the diagnostic placement was incomplete for this stream. The QBUF-side diagnostic (just authored in this Phase 2) is needed. After rebuild + re-test, we'll have the full `v4l2_buffer` payload at the failing call.

### Run 2 — YouTube avc1 (planned)

YouTube URL: `https://www.youtube.com/watch?v=7DAPd5MGodY` with Enhancer for YouTube extension forcing h264 codec.

Prior iter5-amend run (without diagnostic): emitted both `Unable to set control(s)` and `Unable to queue buffer` errors across 4 cap_pool_init events (multiple decode attempts on one tab as the player retried).

Run 2 with iter6-DX driver will print the failing compound H.264 control's id + size (and which others passed). Will identify whether the issue is in SPS, PPS, SLICE_PARAMS, DECODE_PARAMS, or SCALING_MATRIX.

## Why-but-not-mpv reasoning

mpv `--hwdec=vaapi-copy` writes its own libva client code inheriting from FFmpeg's reference implementation. Firefox's Linux media stack uses the libva wrapper through the FFmpeg path too, but the surface-management is different: Firefox attaches DMA-buf VAExportSurfaceHandle for zero-copy, while mpv-vaapi-copy uses VAImage readback (per the `-copy` semantics).

Possible divergences that lead to S_EXT_CTRLS / QBUF EINVAL:
- DMA-buf attach changes the buffer's exported state — kernel may reject QBUF on a buffer that has been EXPBUF'd
- VABufferType ordering — Firefox may submit slice data buffers before SPS/PPS for some frames
- Firefox's SPS payload builder might include a field mpv doesn't (e.g. profile_idc edge case, separate_colour_plane_flag)
- request_fd lifecycle: iter4 fixed mpv's per-frame request_fd allocation; Firefox may share request_fd across frames in a way the kernel rejects

The diagnostic output should narrow this in a single retest.

## Next steps (Phase 2 → Phase 3 → Phase 4)

1. Add QBUF-side diagnostic to ohm's `/home/mfritsche/iter6-fork-dx/src/v4l2.c` (waiting on ohm-tools MCP reconnection).
2. Rebuild + reinstall iter6-DX driver on ohm.
3. Re-run bbb fixture: capture full v4l2_buffer payload at QBUF failure.
4. Re-run YouTube avc1: capture per-control TRY isolation output identifying failing compound H.264 control.
5. Document findings as Phase 2 evidence.
6. Phase 3: anchor pre-fix baseline (iter5-end driver behavior + iter5-amend Firefox + iter6-DX driver behavior) for before/after comparison.
7. Phase 4: design + implement the actual semantic fix in libva-side struct setup (per `feedback_no_fixture_hardcoding.md`: must be a general-case fix, not a stream-specific kludge).

## Stop point

Waiting on ohm-tools MCP reconnect. After that, autonomous through Phase 4 plan authoring (sonnet review at Phase 5 is mandatory per CLAUDE.md).

## Phase 2 finding update (post-payload-comparison telemetry, 2026-05-05)

### Working symptom shifted

Initial assumed bug: "Firefox VIDIOC_S_EXT_CTRLS EINVAL on frame 1, all controls fail individually with cluster-validation error_idx=count=1, looks like iter4 request_fd-staleness pattern reappearing in a different consumer."

Actual current bug (after payload comparison + multiple test cycles):

| Phase | mpv-vaapi-copy | Firefox |
|---|---|---|
| Device-init `S_EXT_CTRLS` (DECODE_MODE/START_CODE, request_fd=-1) | rc=0 | rc=0 |
| Per-frame `S_EXT_CTRLS` (4 controls, request_fd≥0) | rc=0 sustained 2000+ frames | rc=0 for 19 frames |
| Per-frame `QBUF` (OUTPUT, request_fd) | rc=0 sustained 2000+ frames | **EINVAL on frame 20** |

Failing QBUF detail (steady-state): `video_fd=32 request_fd=30 type=10 memory=1 index=1 length=1 bytesused(plane0)=66661 flags=0x800000`. type=10=`V4L2_BUF_TYPE_VIDEO_OUTPUT_MPLANE`, flags `V4L2_BUF_FLAG_REQUEST_FD`. Original "first-frame fail" pattern observed before this Phase 2 work seems to have been transient kernel/Firefox state — does not reproduce on fresh runs with the iter6-DX diagnostic build.

### Payload comparison (mpv vs Firefox, frame 1)

First 64 bytes of every per-frame ext_control are **byte-identical** between mpv (works) and Firefox (works through frame 19, fails at QBUF on frame 20):
```
SPS:             4d 00 29 00 01 00 00 01 00 03 02 00 00 00 ...   (Main profile, level 4.1)
PPS:             00 00 00 00 00 00 02 00 00 00 8a 00
DECODE_PARAMS:   00 00 00 00 ... (zeroed DPB at first frame)
SCALING_MATRIX:  10 10 10 10 ... (default flat 0x10 = 16)
```
So the *control payloads* aren't where mpv and Firefox diverge.

### New hypothesis (cap_pool / source-index recycling)

mpv-vaapi-copy uses single-surface recycling — always one OUTPUT buffer index recycled in place. Firefox rotates through multiple surfaces → driver issues OUTPUT QBUF on `source_index=0, 1, 2, ...` over consecutive frames. By frame 20, some `source_index` is being requeued before the prior submission has been DQBUF'd by RequestSyncSurface, OR the buffer pool is exhausted, OR the kernel-side OUTPUT buffer state machine is rejecting recycle.

This is structurally similar to iter5 Phase 5 sonnet caveat C4 (cap_pool resolution-change race) — buffer-pool lifecycle issue. Possibly a different facet of the same root weakness: cap_pool / OUTPUT buffer drain isn't sequenced cleanly when the consumer issues fast-rotating surface IDs.

### Diagnostic build status (iter6-DX)

- Source tree: `/home/mfritsche/iter6-fork-dx/` on ohm
- Local-only patches in `src/v4l2.c`:
  - per-control TRY isolation in `v4l2_set_controls` failure path
  - per-call S_EXT_CTRLS rc + first-control id/value log
  - first-perframe-call payload hex dump (one-shot)
  - QBUF failure full v4l2_buffer struct dump
- Built sha (current): `63ea0e630748cc81e98c3d3108fa5f593f60f880fa6a18a28fbafb6864740648`
- Installed at `/usr/lib/dri/v4l2_request_drv_video.so`
- Iter5-end backup at `/home/mfritsche/v4l2_request_drv_video.so.iter5end.bak`

Diagnostics are diagnostic-only (no semantic change). Will not be committed to campaign repo.

### Next steps

The OUTPUT-QBUF-on-frame-20 finding warrants:
1. Cross-check: does our driver allocate enough OUTPUT buffers? `REQBUFS` count vs surface rotation pattern.
2. Do we DQBUF OUTPUT buffers correctly between submissions? Look at picture.c QBUF path + surface.c RequestSyncSurface DQBUF path.
3. Strace at frame ~19-20 boundary to see EXACT kernel ioctl sequence preceding the EINVAL.
4. Whether iter5 Phase 5 caveat C4 (cap_pool race) and this iter6-I finding share root cause — they might be the same bug under different names.

User check-in needed: this is structurally similar to iter5 carryover candidate A (cap_pool resolution-change race). Probably worth merging iter6-I scope with candidate A or treating them as one investigation.

Operator decision (2026-05-05): merge accepted. Scope is now A∪I — the merged buffer-pool / surface-recycle lifecycle.

## Phase 2 wrap (2026-05-05) — bug is intermittent, request_fd / OUTPUT-buffer race

After three test runs with the iter6-DX diagnostic build:

| Run | Successful frames | Failure point |
|---|---|---|
| 1 | 1 | OUTPUT QBUF EINVAL index=3 |
| 2 | 19 | OUTPUT QBUF EINVAL index=1 |
| 3 | 53 | OUTPUT QBUF EINVAL index=0 |

Failure is **intermittent**. Different frame numbers. Different OUTPUT pool slot indices. Always: request_fd=30, type=OUTPUT_MPLANE, REQUEST_FD flag set. Always: bytesused varies with the frame's slice payload size. mpv-vaapi-copy never hits it (verified clean ≥3 frames in this Phase 2 testing; iter5 verified 2000 frames clean).

### Theories ruled out

- **DQBUF returns wrong index** — diagnostic logs `expected vs kernel-returned` and fired zero mismatches across 53 successful + 1 failed cycle. Kernel returns the same index our code passes.
- **Control payload divergence** — first 64 bytes of every per-frame ext_control byte-identical between mpv (works) and Firefox (works frame N, fails frame N+1).
- **Frame-1 setup failure** — original symptom does not reproduce on iter6-DX. S_EXT_CTRLS device-init call (DECODE_MODE/START_CODE) succeeds for both consumers.

### Surviving hypothesis

**request_fd lifecycle race.** request_fd=30 is reused on every frame (Linux fd-allocation reuses lowest-free after close). The iter4 fix (`385dee1`) closes request_fd in RequestSyncSurface after DQBUF. But:

1. If close() does not synchronously release the kernel-side request object (it may defer cleanup until the request is fully drained from any in-flight queue), the next `MEDIA_IOC_REQUEST_ALLOC` may return fd=30 pointing at a *new* request object while the OLD object is still mid-cleanup with its OUTPUT buffer attached.
2. Then frame N+1's QBUF(OUTPUT, request_fd=30) attempts to attach a new buffer to a request that already has one — kernel rejects with EINVAL.

Why mpv doesn't hit this: single-surface recycle means each frame's `RequestSyncSurface → close(request_fd) → DQBUF → next BeginPicture` cycle fully completes before the next frame's BeginPicture. Firefox's MediaSource pipeline issues multiple BeginPicture in flight (separate libva surface IDs), so a stale request_fd from a not-yet-fully-closed previous frame can collide with a new frame's allocation.

### Phase 4 fix sketch

Three angles, leading candidate is **C**:

A. **Use unique request fds, not reused-30.** After `media_request_alloc`, `dup()` the fd to push it past low-fd-reuse range, OR maintain a pool of pre-allocated request fds reset via REINIT. iter4 chose close+alloc over REINIT because REINIT failed for the iter1-3 frame-11 case — but that may have been the DPB / control-payload bug, not this lifecycle race.

B. **Serialize close+alloc with explicit barrier.** After `close(request_fd)`, sleep / probe before next `MEDIA_IOC_REQUEST_ALLOC`. Brittle.

C. **Drain OUTPUT pool slot before surface re-bind (mirror cap_pool discipline)**. picture.c:245 unbinds CAPTURE slot if `surface_object->current_slot != NULL`. Add equivalent for OUTPUT: at start of BeginPicture, if the surface still owns an OUTPUT slot from a previous decode that wasn't sync'd yet, drive that slot to drain (close request_fd, DQBUF its buffer, request_pool_release) before acquiring a new slot. Mirrors iter4's "drain before reuse" discipline at the right layer.

C is the cleanest. The OUTPUT-side gap mirrors the iter4 work for request_fd; this extends the same discipline to the OUTPUT pool slot.

### Stop point (revised)

Phase 2 has identified the bug class: **OUTPUT-pool / request_fd race, surfaced by Firefox's multi-surface concurrent-BeginPicture pattern that mpv-vaapi-copy doesn't hit**.

Phase 4 leading approach: **C — extend iter4's "drain before reuse" discipline from request_fd to OUTPUT pool slot**, with the same close-then-fresh-alloc pattern applied to the OUTPUT slot lifecycle when surface state forces reuse.

Phase 5 sonnet review will be mandatory before any commit (per CLAUDE.md user-global rule).

## Phase 5 (front-loaded design review, 2026-05-05)

Sonnet review surfaced two important caveats:
1. **Pool exhaustion as competing hypothesis** — should test by bumping `request_pool_init` from 4 to 16 *before* implementing approach C.
2. **Approach C is mostly dead code in normal paths** — `EndPicture` (picture.c:389) calls `RequestSyncSurface` inline, which clears `source_data=NULL`. Drain-before-rebind only matters for error-path recovery.

Run that diagnostic now.

### Pool-size bump diagnostic (2026-05-05)

Changed `context.c:99-100` `request_pool_init(..., 4)` → `..., 16` in iter6-DX. Rebuilt (sha `e7ed864d693449a512c66d1c547b567ce6336d087c421d645eef7d7a7f248765`). Retested Firefox bbb fixture.

| Pool size | Successful frames | Failure slot | Notes |
|---|---|---|---|
| 4 (run 1) | 1 | index=3 | Original |
| 4 (run 2) | 19 | index=1 | |
| 4 (run 3) | 53 | index=0 | |
| 16 | 62 | index=13 | Slot 13 → ~4th reuse of slot |

**Conclusion:** pool size is not the dominant fix. Bumping 4→16 pushed failure from frame 53 to 62 (marginal). 30fps × 30s = 900 frames target; pool=24 won't close that gap. Failure mechanism is real per-slot kernel-side state contention, not pool exhaustion. (Pool exhaustion would manifest as `request_pool_acquire` returning -1 and BeginPicture cleanly failing with `ALLOCATION_FAILED`, not QBUF EINVAL on a freshly-acquired slot.)

The race fires on what appears to be the **~4th reuse of a slot** under Firefox's surface-rotation rate. mpv-vaapi-copy reuses slot 0 forever and never hits it because the per-buffer kernel state machine cycles cleanly when each cycle fully drains before the next begins. Firefox's pipeline issues per-surface BeginPicture/EndPicture, but `EndPicture` calls `RequestSyncSurface` inline (picture.c:389), so on its face the per-surface cycle should also be synchronous.

This means the race is happening **between SUCCESSIVE surface-decode cycles**, not within a single one. Specifically: surface A's `RequestSyncSurface(A)` does `close(request_fd_A)` + `DQBUF OUTPUT[slot_A]` — both kernel-clean operations from userspace's perspective. Surface B's next `BeginPicture(B)` then does `media_request_alloc()` (returns a fresh kernel request object, possibly at fd=30 again due to lowest-free-fd reuse) + `request_pool_acquire()` (returns slot_B which may have been used by an even-earlier surface decode). QBUF on slot_B with the new request_fd: **EINVAL on roughly the 4th reuse of any given slot.**

### Revised hypothesis

The kernel-side V4L2 OUTPUT buffer state machine retains a residual reference to the *last* request object that touched it, even after our DQBUF returns success. Reusing the buffer with a NEW request object (different request_fd) within ~3-4 frames of the prior use racy-fails because the kernel's buffer-to-request linkage hasn't been fully torn down.

This is hantro-driver kernel-side state. Userspace mitigations:
- **Increase the buffer cycle distance** (pool size) — papers over but doesn't eliminate
- **Hold each request_fd for the lifetime of its OUTPUT slot** (one-to-one binding) — eliminates the cross-pollination but requires either REINIT (iter4 ruled out for compound-control reasons) or some way to "reset" the request_object without close+alloc
- **Sleep / probe-ioctl barrier** between close-request_fd and next QBUF on the same slot — gross but effective
- **Detect the EINVAL on QBUF and retry once after barrier** — error-path recovery, ugly

### Stop point (Phase 2 closes here)

Phase 2 has narrowed the bug class to: **kernel-side per-buffer state contention with the V4L2 request object lifecycle on hantro G1, surfaced when consumers rotate OUTPUT slots faster than the kernel cleans up the buffer-to-request linkage.** Pool size only mitigates marginally.

**Operator decision needed before Phase 4:**

- **Direction 1**: deeper kernel-side analysis — read hantro_drv.c / v4l2-mem2mem / videobuf2 source for the OUTPUT-buffer-to-request state machine, identify the exact cleanup ordering, design a userspace barrier that aligns with it. This may also surface upstream kernel-bug avenues.

- **Direction 2**: pragmatic mitigation — combine pool size = `surfaces_count + DPB_MAX` with QBUF-EINVAL-retry-with-barrier. Will likely get us past 900-frame target but isn't architecturally clean.

- **Direction 3**: investigate REINIT-based per-slot request_fd binding. iter4 ruled this out for the iter1-3 frame-11 case (control-payload-related); this is a different failure mode and REINIT may now work. Worth a 30-min experiment.

Direction 3 is the cleanest if it works. Direction 1 is most thorough. Direction 2 is fastest.