iter6 Phase 2 final update: pool-bump diagnostic + REINIT pivot

Captures the post-Phase-5-design-review experiment: - pool=4 -> 16 only delayed failure (1/19/53 -> 62 frames), refuting pool exhaustion as the dominant cause - Surviving hypothesis: kernel-side buffer-state contention with fd-reuse on close+alloc cycle, surfaced by Firefox's multi- surface MediaSource pattern - Three Phase 4 directions weighed; operator chose 3 (REINIT) with 1 (kernel analysis) as fallback if 3 failed This file now serves as the iter6 Phase 2 reference for future iterations — the "Phase 5 (front-loaded design review)" section documents the back-and-forth with sonnet that shaped the final fix shape. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 21:45:01 +00:00
parent 104abd1624
commit 4cd0167d08
1 changed files with 49 additions and 0 deletions
@@ -172,3 +172,52 @@ Phase 2 has identified the bug class: **OUTPUT-pool / request_fd race, surfaced
 Phase 4 leading approach: **C — extend iter4's "drain before reuse" discipline from request_fd to OUTPUT pool slot**, with the same close-then-fresh-alloc pattern applied to the OUTPUT slot lifecycle when surface state forces reuse.

 Phase 5 sonnet review will be mandatory before any commit (per CLAUDE.md user-global rule).
+
+## Phase 5 (front-loaded design review, 2026-05-05)
+
+Sonnet review surfaced two important caveats:
+1. **Pool exhaustion as competing hypothesis** — should test by bumping `request_pool_init` from 4 to 16 *before* implementing approach C.
+2. **Approach C is mostly dead code in normal paths** — `EndPicture` (picture.c:389) calls `RequestSyncSurface` inline, which clears `source_data=NULL`. Drain-before-rebind only matters for error-path recovery.
+
+Run that diagnostic now.
+
+### Pool-size bump diagnostic (2026-05-05)
+
+Changed `context.c:99-100` `request_pool_init(..., 4)` → `..., 16` in iter6-DX. Rebuilt (sha `e7ed864d693449a512c66d1c547b567ce6336d087c421d645eef7d7a7f248765`). Retested Firefox bbb fixture.
+
+| Pool size | Successful frames | Failure slot | Notes |
+|---|---|---|---|
+| 4 (run 1) | 1 | index=3 | Original |
+| 4 (run 2) | 19 | index=1 | |
+| 4 (run 3) | 53 | index=0 | |
+| 16 | 62 | index=13 | Slot 13 → ~4th reuse of slot |
+
+**Conclusion:** pool size is not the dominant fix. Bumping 4→16 pushed failure from frame 53 to 62 (marginal). 30fps × 30s = 900 frames target; pool=24 won't close that gap. Failure mechanism is real per-slot kernel-side state contention, not pool exhaustion. (Pool exhaustion would manifest as `request_pool_acquire` returning -1 and BeginPicture cleanly failing with `ALLOCATION_FAILED`, not QBUF EINVAL on a freshly-acquired slot.)
+
+The race fires on what appears to be the **~4th reuse of a slot** under Firefox's surface-rotation rate. mpv-vaapi-copy reuses slot 0 forever and never hits it because the per-buffer kernel state machine cycles cleanly when each cycle fully drains before the next begins. Firefox's pipeline issues per-surface BeginPicture/EndPicture, but `EndPicture` calls `RequestSyncSurface` inline (picture.c:389), so on its face the per-surface cycle should also be synchronous.
+
+This means the race is happening **between SUCCESSIVE surface-decode cycles**, not within a single one. Specifically: surface A's `RequestSyncSurface(A)` does `close(request_fd_A)` + `DQBUF OUTPUT[slot_A]` — both kernel-clean operations from userspace's perspective. Surface B's next `BeginPicture(B)` then does `media_request_alloc()` (returns a fresh kernel request object, possibly at fd=30 again due to lowest-free-fd reuse) + `request_pool_acquire()` (returns slot_B which may have been used by an even-earlier surface decode). QBUF on slot_B with the new request_fd: **EINVAL on roughly the 4th reuse of any given slot.**
+
+### Revised hypothesis
+
+The kernel-side V4L2 OUTPUT buffer state machine retains a residual reference to the *last* request object that touched it, even after our DQBUF returns success. Reusing the buffer with a NEW request object (different request_fd) within ~3-4 frames of the prior use racy-fails because the kernel's buffer-to-request linkage hasn't been fully torn down.
+
+This is hantro-driver kernel-side state. Userspace mitigations:
+- **Increase the buffer cycle distance** (pool size) — papers over but doesn't eliminate
+- **Hold each request_fd for the lifetime of its OUTPUT slot** (one-to-one binding) — eliminates the cross-pollination but requires either REINIT (iter4 ruled out for compound-control reasons) or some way to "reset" the request_object without close+alloc
+- **Sleep / probe-ioctl barrier** between close-request_fd and next QBUF on the same slot — gross but effective
+- **Detect the EINVAL on QBUF and retry once after barrier** — error-path recovery, ugly
+
+### Stop point (Phase 2 closes here)
+
+Phase 2 has narrowed the bug class to: **kernel-side per-buffer state contention with the V4L2 request object lifecycle on hantro G1, surfaced when consumers rotate OUTPUT slots faster than the kernel cleans up the buffer-to-request linkage.** Pool size only mitigates marginally.
+
+**Operator decision needed before Phase 4:**
+
+- **Direction 1**: deeper kernel-side analysis — read hantro_drv.c / v4l2-mem2mem / videobuf2 source for the OUTPUT-buffer-to-request state machine, identify the exact cleanup ordering, design a userspace barrier that aligns with it. This may also surface upstream kernel-bug avenues.
+
+- **Direction 2**: pragmatic mitigation — combine pool size = `surfaces_count + DPB_MAX` with QBUF-EINVAL-retry-with-barrier. Will likely get us past 900-frame target but isn't architecturally clean.
+
+- **Direction 3**: investigate REINIT-based per-slot request_fd binding. iter4 ruled this out for the iter1-3 frame-11 case (control-payload-related); this is a different failure mode and REINIT may now work. Worth a 30-min experiment.
+
+Direction 3 is the cleanest if it works. Direction 1 is most thorough. Direction 2 is fastest.