iter7: A+B+C — slot-leak fix, cap_pool harness, msync verify harness

Closes three internal carry items in one fork commit. iter6 deferred these as TODOs; iter7 lands the implementations + supporting tests. # Track B — slot-leak error recovery (src/) iter6 documented the RequestSyncSurface error paths as a "bounded leak we accept" — slots stayed busy=true after REINIT/DQBUF failures until RequestTerminate ran. With pool=16 and rare errors this was acceptable, but a sustained-error scenario could starve the pool. Adds request_pool_force_release(pool, index) which: 1. Tries media_request_reinit on the slot's fd (cheap path) 2. Falls back to close + media_request_alloc (recovery) 3. Leaves the slot dead-busy if even alloc fails (other slots unaffected, pool capacity reduced by 1 until destroy) Wires it into surface.c RequestSyncSurface error paths only for errors before the OUTPUT-DQBUF attempt. After OUTPUT-DQBUF failure the V4L2 buffer is in indeterminate kernel state, so a separate error label (`error_buffer_indeterminate`) leaves the slot dead-busy — reusing the slot would QBUF on a kernel-still-held buffer and EINVAL. Phase 5 sonnet review caught this discriminator subtlety pre-commit. Files: request_pool.{h,c}, surface.c. # Track C — cap_pool race synthetic harness (tests/) iter5 sonnet C4 / iter6 candidate A: cap_pool resolution-change race was organically exercised by YT's quality renegotiations (iter6 close, 4 cap_pool_init events clean) but had no deterministic regression test. tests/cap_pool_probe_pattern.c — ~170-line C program: opens libva display, vaCreateConfig, vaCreateSurfaces(small) + vaCreateContext (triggers OUTPUT pool init at small resolution), dispose, vaCreateSurfaces(big) + vaCreateContext (forces S_FMT on the new resolution against an in-use OUTPUT pool — the actual race-hitting path). Phase 5 sonnet flagged that without vaCreateContext the test would pass trivially (OUTPUT pool never init'd, REQBUFS(0) on empty queue is a no-op). Fixed before commit. tests/run_cap_pool_probe.sh — runner; greps driver stderr for REQBUFS / EBUSY / "Unable to set format" race indicators. # Track A — msync pixel-correctness verify harness (tests/) iter5 sweep removed msync(MS_SYNC|MS_INVALIDATE) from CAPTURE DQBUF path. iter5 sonnet C3 flagged: no formal pixel verification. tests/run_msync_pixel_verify.sh — runs FFmpeg SW decode (libavcodec reference) and FFmpeg HW decode (via our v4l2_request driver), compares NV12 byte streams. Probes fixture dimensions via ffprobe and uses crop=$W:$H after hwdownload to normalize MB-padding artifacts (hantro pads height to 16-line align; SW returns crop-aligned). Phase 5 sonnet flagged the stride-mismatch false-failure risk pre-commit. Fixed: explicit crop + diagnostic that distinguishes genuine pixel divergence from MB-padding stride artifacts. # Phase 5 sonnet code review Verdict: APPROVE-WITH-CHANGES. Three actionable findings, all addressed before this commit: 1. surface.c error path: separated OUTPUT-DQBUF-failure into error_buffer_indeterminate label, slot stays dead-busy 2. cap_pool_probe_pattern.c: added vaCreateContext to actually exercise the OUTPUT pool init at the small resolution 3. run_msync_pixel_verify.sh: explicit crop on HW path, stride-mismatch diagnostic distinguished from corruption Empirical verification (Phase 6+7 deploy + run): pending operator ohm-tools availability. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-06 06:49:48 +00:00
parent a09c03c154
commit 988b848908
6 changed files with 494 additions and 13 deletions
@@ -41,6 +41,7 @@ int request_pool_init(struct request_pool *pool, int video_fd, int media_fd,

 	pool->count = count;
 	pool->next = 0;
+	pool->media_fd = media_fd;	/* iter7: kept for force_release re-alloc */

 	for (i = 0; i < count; i++)
 		pool->slots[i].request_fd = -1;
@@ -152,6 +153,62 @@ void request_pool_release(struct request_pool *pool, unsigned int index)
 	}
 }

+void request_pool_force_release(struct request_pool *pool, unsigned int index)
+{
+	struct request_pool_slot *slot;
+	unsigned int i;
+
+	if (pool == NULL || pool->slots == NULL)
+		return;
+
+	slot = NULL;
+	for (i = 0; i < pool->count; i++) {
+		if (pool->slots[i].index == index) {
+			slot = &pool->slots[i];
+			break;
+		}
+	}
+	if (slot == NULL)
+		return;
+
+	/*
+	 * Try to recover the kernel-side request object via REINIT first.
+	 * REINIT is the cheap path: kernel resets the request in place,
+	 * fd stays valid, slot can be reused immediately.
+	 */
+	if (slot->request_fd >= 0 && media_request_reinit(slot->request_fd) == 0) {
+		slot->busy = false;
+		return;
+	}
+
+	/*
+	 * REINIT failed (or slot's fd was already invalid). Close the fd
+	 * and try to allocate a fresh one. This costs an extra ioctl pair
+	 * relative to the REINIT happy path but keeps the slot usable.
+	 *
+	 * NOTE: alloc may return the same lowest-free fd number that was
+	 * just closed. That's fine here because (a) this is a rare error-
+	 * recovery path, not the per-frame happy path, and (b) the slot's
+	 * V4L2 buffer has already been DQBUF'd by this point (or is in an
+	 * indeterminate state we can't recover from regardless), so the
+	 * iter6 race condition (cross-slot fd-reuse against a kernel
+	 * buffer in mid-cleanup) does not apply.
+	 */
+	if (slot->request_fd >= 0)
+		close(slot->request_fd);
+	slot->request_fd = media_request_alloc(pool->media_fd);
+	if (slot->request_fd < 0) {
+		/*
+		 * Realloc failed. Slot is now permanently dead — leave
+		 * busy=true so acquire skips it. Pool capacity is
+		 * effectively reduced by 1 until pool destroy.
+		 */
+		return;
+	}
+
+	slot->busy = false;
+}
+
 struct request_pool_slot *request_pool_slot(struct request_pool *pool,
 					    unsigned int index)
 {