iter6 fix: per-OUTPUT-slot request_fd binding via REINIT

iter4 (385dee1) replaced the original media_request_reinit pattern with close+media_request_alloc per frame to escape an EINVAL on S_EXT_CTRLS that turned out to be a DPB-payload bug (74d8dd1, FFmpeg V4L2_H264_FRAME_REF semantics). The per-frame close+alloc model worked for mpv vaapi-copy (single-surface recycle) but raced under Firefox 150's MediaSource pipeline (multi-surface rotation): fd=30 got reused via lowest-free-fd allocation faster than the kernel- side per-buffer state-machine could tear down the prior request, producing intermittent VIDIOC_QBUF EINVAL on OUTPUT after 1..53 successful frames. Phase 2 telemetry confirmed: - DQBUF returned the index we passed (no FIFO mismatch) - SPS/PPS/DECODE_PARAMS/SCALING_MATRIX byte-identical between mpv and Firefox first 64 bytes - Pool size bump 4 -> 16 only delayed the failure (62 frames) - Different OUTPUT slot indices failed across runs (race signature) Fix: each OUTPUT pool slot owns a permanent request_fd allocated once at request_pool_init and REINIT'd between uses in RequestSyncSurface. 1:1 slot-to-fd binding eliminates cross-slot fd reuse entirely. Pool stays driver-wide (multi-context safe per iter5 Track E); slots cycle through 16 distinct fds in round-robin acquire. Files: - request_pool.h: add request_fd field to slot struct; init signature takes media_fd - request_pool.c: alloc per-slot fd at init, close at destroy - context.c: pass driver_data->media_fd; pool size 4 -> 16 - picture.c: BeginPicture binds slot->request_fd to surface; EndPicture's per-frame media_request_alloc removed - surface.c: RequestSyncSurface uses media_request_reinit instead of close+alloc; DestroySurfaces close removed (slot owns fd); error path close removed; surface_object NULL-init for the -Wmaybe-uninitialized warning fix Empirical verification (clean build sha ebe396d5..., no diagnostic instrumentation): - Firefox 150 + bbb_1080p30_h264.mp4 + LIBVA_DRIVER_NAME=v4l2_request + sandbox enabled: 35s+ playback, zero "Unable to queue buffer" / "Unable to set control(s)", lsof shows RDD process holds /dev/video1 + /dev/media0 throughout. Driver stderr: only the single cap_pool_init: 24 slots ready line. - mpv vaapi-copy 50 frames: zero errors, "Using hardware decoding (vaapi-copy)" - no regression vs iter5-end driver. Pool-size bump diagnostic (Phase 5 sonnet design review feedback): 4 -> 16 alone took 1->62 frames, far short of the 30s success criterion (~900 frames at 30fps). REINIT discipline is the actual fix; pool 16 is comfortable headroom over typical H.264 MaxDpbFrames. Phase 5 sonnet code review: APPROVE-WITH-CHANGES (one comment attribution corrected: cleanup runs at RequestTerminate, not RequestDestroyContext, since the pool is driver-wide). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-05 21:30:39 +00:00
parent c8b6edec3d
commit a09c03c154
5 changed files with 92 additions and 26 deletions
@@ -394,8 +394,13 @@ VAStatus RequestDestroySurfaces(VADriverContextP context,
 		 */
 		surface_unbind_slot(driver_data, surface_object);

-		if (surface_object->request_fd > 0)
-			close(surface_object->request_fd);
+		/*
+		 * iter6: request_fd is owned by the OUTPUT pool slot, not by
+		 * the surface. Do not close here. The pool closes all slot
+		 * fds at request_pool_destroy time, which fires from
+		 * RequestTerminate (driver unload) — the pool is driver-wide
+		 * and survives context destroy/recreate cycles.
+		 */

 		object_heap_free(&driver_data->surface_heap,
 				 (struct object_base *)surface_object);
@@ -408,7 +413,7 @@ VAStatus RequestSyncSurface(VADriverContextP context, VASurfaceID surface_id)
 {

 	struct request_data *driver_data = context->pDriverData;
-	struct object_surface *surface_object;
+	struct object_surface *surface_object = NULL;
 	VAStatus status;
 	struct video_format *video_format;
 	unsigned int output_type, capture_type;
@@ -454,18 +459,26 @@ VAStatus RequestSyncSurface(VADriverContextP context, VASurfaceID surface_id)
 	}

 	/*
-	 * iter4: instead of REINITing for reuse, close the request_fd here
-	 * and force the next BeginPicture to allocate a fresh one. The cached
-	 * request_fd path was hitting EINVAL on S_EXT_CTRLS for some surface
-	 * recycle patterns (4 individual TRY_EXT_CTRLS on the same fd all
-	 * fail with EINVAL — the fd state is bad even though queue+wait+reinit
-	 * appeared successful). Allocating fresh per frame is unambiguous and
-	 * sidesteps any state-lifecycle issue. Tradeoff: ~one extra ioctl per
-	 * frame (MEDIA_IOC_REQUEST_ALLOC + close), negligible cost.
+	 * iter6: the request_fd belongs to the OUTPUT pool slot, not to the
+	 * surface. REINIT to reset its state in place — close+alloc would
+	 * reuse the lowest-free fd number against a kernel object whose
+	 * teardown hasn't fully drained, racing with QBUF on a slot that
+	 * was just released. The pool's 1:1 slot-to-fd binding eliminates
+	 * cross-slot fd reuse, and REINIT here resets the request object
+	 * for the next decode cycle on the same slot.
+	 *
+	 * Iter4's frame-11 EINVAL (which prompted the iter4 close+alloc
+	 * model) was a control-payload bug — DPB carry-over with FFmpeg's
+	 * V4L2_H264_FRAME_REF semantics not yet matched. That's been fixed
+	 * since iter4 (`74d8dd1`), so REINIT is no longer compromised by
+	 * the cluster-validation EINVAL pattern.
 	 */
-	close(request_fd);
+	rc = media_request_reinit(request_fd);
+	if (rc < 0) {
+		status = VA_STATUS_ERROR_OPERATION_FAILED;
+		goto error;
+	}
 	surface_object->request_fd = -1;
-	(void)0; /* placeholder for the now-removed reinit error path */

 	rc = v4l2_dequeue_buffer(driver_data->video_fd, -1, output_type,
 				 surface_object->source_index, 1);
@@ -509,10 +522,22 @@ VAStatus RequestSyncSurface(VADriverContextP context, VASurfaceID surface_id)
 	goto complete;

 error:
-	if (request_fd >= 0) {
-		close(request_fd);
+	/*
+	 * iter6: request_fd is owned by the OUTPUT pool slot. Do not
+	 * close here on error. The slot's REINIT may not have run if
+	 * we errored before it; the slot stays busy=true and is not
+	 * returned to acquire-rotation until RequestTerminate runs
+	 * request_pool_destroy. (The pool is driver-wide; it survives
+	 * RequestDestroyContext.) That's a bounded leak we accept: with
+	 * pool size 16 and rare errors, slot starvation only matters
+	 * after many error cycles — at which point acquire returns -1
+	 * cleanly.
+	 *
+	 * TODO: a future iteration could add a request_pool_force_release
+	 * that REINITs the fd and frees the slot for error recovery.
+	 */
+	if (surface_object != NULL)
 		surface_object->request_fd = -1;
-	}

 complete:
 	return status;