iter7: A+B+C — slot-leak fix, cap_pool harness, msync verify harness
Closes three internal carry items in one fork commit. iter6 deferred
these as TODOs; iter7 lands the implementations + supporting tests.
# Track B — slot-leak error recovery (src/)
iter6 documented the RequestSyncSurface error paths as a "bounded
leak we accept" — slots stayed busy=true after REINIT/DQBUF failures
until RequestTerminate ran. With pool=16 and rare errors this was
acceptable, but a sustained-error scenario could starve the pool.
Adds request_pool_force_release(pool, index) which:
1. Tries media_request_reinit on the slot's fd (cheap path)
2. Falls back to close + media_request_alloc (recovery)
3. Leaves the slot dead-busy if even alloc fails (other slots
unaffected, pool capacity reduced by 1 until destroy)
Wires it into surface.c RequestSyncSurface error paths only for
errors before the OUTPUT-DQBUF attempt. After OUTPUT-DQBUF failure
the V4L2 buffer is in indeterminate kernel state, so a separate
error label (`error_buffer_indeterminate`) leaves the slot
dead-busy — reusing the slot would QBUF on a kernel-still-held
buffer and EINVAL.
Phase 5 sonnet review caught this discriminator subtlety pre-commit.
Files: request_pool.{h,c}, surface.c.
# Track C — cap_pool race synthetic harness (tests/)
iter5 sonnet C4 / iter6 candidate A: cap_pool resolution-change
race was organically exercised by YT's quality renegotiations
(iter6 close, 4 cap_pool_init events clean) but had no
deterministic regression test.
tests/cap_pool_probe_pattern.c — ~170-line C program: opens
libva display, vaCreateConfig, vaCreateSurfaces(small) +
vaCreateContext (triggers OUTPUT pool init at small resolution),
dispose, vaCreateSurfaces(big) + vaCreateContext (forces S_FMT
on the new resolution against an in-use OUTPUT pool — the actual
race-hitting path).
Phase 5 sonnet flagged that without vaCreateContext the test
would pass trivially (OUTPUT pool never init'd, REQBUFS(0) on
empty queue is a no-op). Fixed before commit.
tests/run_cap_pool_probe.sh — runner; greps driver stderr for
REQBUFS / EBUSY / "Unable to set format" race indicators.
# Track A — msync pixel-correctness verify harness (tests/)
iter5 sweep removed msync(MS_SYNC|MS_INVALIDATE) from CAPTURE
DQBUF path. iter5 sonnet C3 flagged: no formal pixel verification.
tests/run_msync_pixel_verify.sh — runs FFmpeg SW decode (libavcodec
reference) and FFmpeg HW decode (via our v4l2_request driver),
compares NV12 byte streams. Probes fixture dimensions via ffprobe
and uses crop=$W:$H after hwdownload to normalize MB-padding
artifacts (hantro pads height to 16-line align; SW returns
crop-aligned).
Phase 5 sonnet flagged the stride-mismatch false-failure risk
pre-commit. Fixed: explicit crop + diagnostic that distinguishes
genuine pixel divergence from MB-padding stride artifacts.
# Phase 5 sonnet code review
Verdict: APPROVE-WITH-CHANGES. Three actionable findings, all
addressed before this commit:
1. surface.c error path: separated OUTPUT-DQBUF-failure into
error_buffer_indeterminate label, slot stays dead-busy
2. cap_pool_probe_pattern.c: added vaCreateContext to actually
exercise the OUTPUT pool init at the small resolution
3. run_msync_pixel_verify.sh: explicit crop on HW path,
stride-mismatch diagnostic distinguished from corruption
Empirical verification (Phase 6+7 deploy + run): pending operator
ohm-tools availability.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -41,6 +41,7 @@ int request_pool_init(struct request_pool *pool, int video_fd, int media_fd,
|
||||
|
||||
pool->count = count;
|
||||
pool->next = 0;
|
||||
pool->media_fd = media_fd; /* iter7: kept for force_release re-alloc */
|
||||
|
||||
for (i = 0; i < count; i++)
|
||||
pool->slots[i].request_fd = -1;
|
||||
@@ -152,6 +153,62 @@ void request_pool_release(struct request_pool *pool, unsigned int index)
|
||||
}
|
||||
}
|
||||
|
||||
void request_pool_force_release(struct request_pool *pool, unsigned int index)
|
||||
{
|
||||
struct request_pool_slot *slot;
|
||||
unsigned int i;
|
||||
|
||||
if (pool == NULL || pool->slots == NULL)
|
||||
return;
|
||||
|
||||
slot = NULL;
|
||||
for (i = 0; i < pool->count; i++) {
|
||||
if (pool->slots[i].index == index) {
|
||||
slot = &pool->slots[i];
|
||||
break;
|
||||
}
|
||||
}
|
||||
if (slot == NULL)
|
||||
return;
|
||||
|
||||
/*
|
||||
* Try to recover the kernel-side request object via REINIT first.
|
||||
* REINIT is the cheap path: kernel resets the request in place,
|
||||
* fd stays valid, slot can be reused immediately.
|
||||
*/
|
||||
if (slot->request_fd >= 0 && media_request_reinit(slot->request_fd) == 0) {
|
||||
slot->busy = false;
|
||||
return;
|
||||
}
|
||||
|
||||
/*
|
||||
* REINIT failed (or slot's fd was already invalid). Close the fd
|
||||
* and try to allocate a fresh one. This costs an extra ioctl pair
|
||||
* relative to the REINIT happy path but keeps the slot usable.
|
||||
*
|
||||
* NOTE: alloc may return the same lowest-free fd number that was
|
||||
* just closed. That's fine here because (a) this is a rare error-
|
||||
* recovery path, not the per-frame happy path, and (b) the slot's
|
||||
* V4L2 buffer has already been DQBUF'd by this point (or is in an
|
||||
* indeterminate state we can't recover from regardless), so the
|
||||
* iter6 race condition (cross-slot fd-reuse against a kernel
|
||||
* buffer in mid-cleanup) does not apply.
|
||||
*/
|
||||
if (slot->request_fd >= 0)
|
||||
close(slot->request_fd);
|
||||
slot->request_fd = media_request_alloc(pool->media_fd);
|
||||
if (slot->request_fd < 0) {
|
||||
/*
|
||||
* Realloc failed. Slot is now permanently dead — leave
|
||||
* busy=true so acquire skips it. Pool capacity is
|
||||
* effectively reduced by 1 until pool destroy.
|
||||
*/
|
||||
return;
|
||||
}
|
||||
|
||||
slot->busy = false;
|
||||
}
|
||||
|
||||
struct request_pool_slot *request_pool_slot(struct request_pool *pool,
|
||||
unsigned int index)
|
||||
{
|
||||
|
||||
@@ -49,6 +49,8 @@ struct request_pool {
|
||||
struct request_pool_slot *slots;
|
||||
unsigned int count;
|
||||
unsigned int next; /* round-robin acquire cursor */
|
||||
int media_fd; /* iter7: kept for
|
||||
* force_release re-alloc */
|
||||
bool initialized;
|
||||
};
|
||||
|
||||
@@ -79,6 +81,21 @@ int request_pool_acquire(struct request_pool *pool);
|
||||
*/
|
||||
void request_pool_release(struct request_pool *pool, unsigned int index);
|
||||
|
||||
/*
|
||||
* iter7: error-recovery release. Called from RequestSyncSurface error
|
||||
* paths when media_request_reinit or VIDIOC_DQBUF failed mid-cycle and
|
||||
* the slot's request_fd is now in an undefined state. REINITs the fd;
|
||||
* if REINIT fails (kernel-side request object too far gone), close
|
||||
* the fd and re-alloc a fresh one. If realloc also fails, the slot
|
||||
* is left busy=true (effectively dead, count decremented by 1) — pool
|
||||
* survives but with reduced capacity until driver terminate. Other
|
||||
* slots are unaffected.
|
||||
*
|
||||
* Caller passes the V4L2 buffer index from request_pool_acquire().
|
||||
*/
|
||||
void request_pool_force_release(struct request_pool *pool,
|
||||
unsigned int index);
|
||||
|
||||
/*
|
||||
* Look up the pool slot owning a given V4L2 buffer index. Returns
|
||||
* pointer to the slot on success, NULL if the index is out of range.
|
||||
|
||||
+57
-13
@@ -484,7 +484,14 @@ VAStatus RequestSyncSurface(VADriverContextP context, VASurfaceID surface_id)
|
||||
surface_object->source_index, 1);
|
||||
if (rc < 0) {
|
||||
status = VA_STATUS_ERROR_OPERATION_FAILED;
|
||||
goto error;
|
||||
/*
|
||||
* iter7: OUTPUT DQBUF failed. The V4L2 buffer is in an
|
||||
* indeterminate kernel state — it may still be QUEUED. Do
|
||||
* NOT return the slot to acquire-rotation: the next QBUF
|
||||
* on it would EINVAL. Leave source_data set so the error
|
||||
* handler skips force_release and the slot stays dead-busy.
|
||||
*/
|
||||
goto error_buffer_indeterminate;
|
||||
}
|
||||
|
||||
/*
|
||||
@@ -523,21 +530,58 @@ VAStatus RequestSyncSurface(VADriverContextP context, VASurfaceID surface_id)
|
||||
|
||||
error:
|
||||
/*
|
||||
* iter6: request_fd is owned by the OUTPUT pool slot. Do not
|
||||
* close here on error. The slot's REINIT may not have run if
|
||||
* we errored before it; the slot stays busy=true and is not
|
||||
* returned to acquire-rotation until RequestTerminate runs
|
||||
* request_pool_destroy. (The pool is driver-wide; it survives
|
||||
* RequestDestroyContext.) That's a bounded leak we accept: with
|
||||
* pool size 16 and rare errors, slot starvation only matters
|
||||
* after many error cycles — at which point acquire returns -1
|
||||
* cleanly.
|
||||
* iter7: error recovery for the OUTPUT pool slot. If the surface
|
||||
* acquired a slot in BeginPicture (source_data != NULL indicates
|
||||
* an active borrow), reset the slot's request_fd via
|
||||
* request_pool_force_release so the slot returns to the
|
||||
* acquire-rotation. force_release tries REINIT first; falls back
|
||||
* to close+alloc if REINIT fails; leaves the slot dead-busy if
|
||||
* even alloc fails (other slots unaffected). Replaces iter6's
|
||||
* accepted bounded leak.
|
||||
*
|
||||
* TODO: a future iteration could add a request_pool_force_release
|
||||
* that REINITs the fd and frees the slot for error recovery.
|
||||
* Reachable from: media_request_queue / wait_completion / REINIT
|
||||
* failures. NOT reachable for OUTPUT-DQBUF failure (separate label
|
||||
* `error_buffer_indeterminate` below) because in that case the
|
||||
* V4L2 buffer is in an indeterminate kernel state and reusing the
|
||||
* slot would EINVAL on the next QBUF.
|
||||
*
|
||||
* If the surface never acquired a slot (source_data == NULL),
|
||||
* there is no slot to release; nothing to do.
|
||||
*/
|
||||
if (surface_object != NULL)
|
||||
if (surface_object != NULL) {
|
||||
if (surface_object->source_data != NULL) {
|
||||
request_pool_force_release(&driver_data->output_pool,
|
||||
surface_object->source_index);
|
||||
surface_object->source_data = NULL;
|
||||
surface_object->source_size = 0;
|
||||
}
|
||||
surface_object->request_fd = -1;
|
||||
}
|
||||
goto complete;
|
||||
|
||||
error_buffer_indeterminate:
|
||||
/*
|
||||
* iter7: OUTPUT DQBUF failed after a successful REINIT. The kernel
|
||||
* V4L2 buffer is in an unknown state (possibly still QUEUED with
|
||||
* pending decode result, possibly half-dequeued, possibly stuck
|
||||
* in driver internals). The slot's request_fd has already been
|
||||
* REINIT'd to a clean state, but reusing the slot for a new
|
||||
* decode would QBUF on a buffer the kernel may still hold —
|
||||
* triggering exactly the iter6 race we eliminated for the happy
|
||||
* path.
|
||||
*
|
||||
* Leave the slot dead-busy: don't release, don't force_release.
|
||||
* Other slots are unaffected. If this fires repeatedly, the pool
|
||||
* leaks slots until starvation, at which point acquire returns -1
|
||||
* and BeginPicture cleanly propagates ALLOCATION_FAILED. This is
|
||||
* a strictly safer failure mode than reusing an indeterminate
|
||||
* V4L2 buffer.
|
||||
*/
|
||||
if (surface_object != NULL) {
|
||||
surface_object->source_data = NULL;
|
||||
surface_object->source_size = 0;
|
||||
surface_object->request_fd = -1;
|
||||
}
|
||||
|
||||
complete:
|
||||
return status;
|
||||
|
||||
Reference in New Issue
Block a user