Iter2 Phase 5 sonnet review: 3 architecture blockers + CAPTURE REQBUFS gap

Plan subagent with model: sonnet, open-consultation. Raw verbatim
review text — operator transcribes to DokuWiki.

Substantive findings beyond the iteration 2 plan as written:

ARCHITECTURE BLOCKERS (must address before Fix 3 implementation):

1. The 3-state slot machine (FREE/IN_DECODE/EXPORTED) is incomplete.
   Need a 4th state DECODED for the window between SyncSurface DQBUF
   and either ExportSurfaceHandle (vaapi path) or DeriveImage
   (vaapi-copy path). Without it, vaapi-copy races Fix 3 — a slot
   in FREE state can be claimed by BeginPicture for a new frame
   while DeriveImage is still copying from its mmap.

2. pthread_mutex_t required around cap_pool slot state. VAAPI is
   re-entrant for multi-threaded consumers; mpv vaapi may call
   EndPicture/SyncSurface from decoder thread and ExportSurfaceHandle
   from display thread.

3. The patch-0011 sentinel write at picture.c:365-373 currently uses
   surface_object->destination_map[0]. After pool refactor, must
   reference slot->map[0] instead.

ADDITIONAL FINDINGS (addressed in Fix 1):

4. The resolution-change path at surface.c:121-123 only does REQBUFS(0)
   on OUTPUT, not CAPTURE. Hantro derives CAPTURE format from OUTPUT,
   so leftover CAPTURE buffers from prior resolution would also block
   the implicit format change. Pre-existing bug; Fix 3 pool refactor
   would expose more frequently. Folded into Fix 1 implementation.

5. WSI fix should be conditional on pitch alignment, not universal
   MOD_INVALID — preserves LINEAR semantics for already-aligned
   1920-wide content.

DEFERRED to iteration 3:

- LAST_OUTPUT_WIDTH/HEIGHT thread safety (static globals; multi-context race)
- Full Option B (V4L2_MEMORY_DMABUF + userspace allocation)
- Multi-context concurrent use

Sonnet endorsed Option A (more buffers + LRU recycling) over Option B
for iteration 2's hardening scope. MIN_CAP_POOL=24 conservatively sized
for single-stream playback (verified against Firefox VideoFramePool
source for hold-window estimates).

Fix 1 + Fix 2 already implemented per revised architecture and pushed
to fork (06beef6, e64bb08). Fix 3 implementation pending ohm
availability for verification.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
2026-05-04 19:39:08 +00:00
parent 9b1d7737cd
commit 3fd51d5e7e
+294
View File
@@ -0,0 +1,294 @@
# Iteration 2 Phase 5 — sonnet review (verbatim)
Reviewer: Plan subagent with `model: sonnet`. Read all artifacts cold.
---
# Iteration 2 Plan — Phase 5 Second-Model Review
**Reviewer scope**: this is a pre-implementation review of the Phase 4 plan. No Phase 6 code exists yet. The review reads all artifacts cold, without deference to the plan's framing.
---
## Question 1: Phase 0 substrate — are the three issues correctly grouped?
The framing holds, but with a qualification on the grouping rationale.
The plan treats the three issues as thematically related via the shared claim "V4L2 per-buffer state is decoupled from dma_buf refcount." That claim is correct for Fix 3. It is only incidentally related to Fix 2 (WSI alignment) and largely unrelated to Fix 1 (multi-resolution cache). The three are co-residencing in iteration 2 because they all manifested after the decode fix landed, not because they share a mechanism.
That is acceptable grouping — "everything we know about that blocks a clean vaapi + Firefox session" — but the Phase 2 analysis slightly oversells the connection. The WSI alignment issue is entirely independent: it would exist even if the buffer lifecycle were correct. The multi-resolution corruption is a caching bug in userspace state, not a dma_buf lifecycle issue.
The practical concern with conflating them is that Fix 2 and Fix 3 interact at the exported descriptor level. The plan does recognize they are tested independently. The interaction point to watch: if Fix 3 changes how `destination_index` is bound at export time, the per-buffer `bytesperline` must travel with the slot, not with the old surface fields. The plan's `cap_pool_slot` struct carries `map_lengths` and `map_offsets` but not `bytesperline`. That is the one piece of the grouping where the two fixes touch and the plan is underspecified (more below under question 6).
Recommendation: keep all three in scope. They are correctly bounded. The mechanical grouping is fine; the theoretical grouping is slightly inflated. Does not need correction before implementation, just worth noting to avoid over-engineering the shared infrastructure.
---
## Question 2: Is Option A (more buffers + LRU) acceptable, or should iteration 2 do Option B?
Option A is the right call for iteration 2. The plan's framing of it as "statistical mitigation, not a correct fix" is accurate, and the rationale for accepting it is sound. Let me sharpen what the actual risk model looks like.
The race window is: `compositor hold time` vs `pool cycle time`. At MIN_CAP_POOL=24, pool cycle time for a 24-fps stream is approximately `24 / fps * buffer_count = 24 / 24 * 24 = 24 frames worth = ~1 second`. Compositor hold time at 60 fps with a 3-frame compositor pipeline is ~50ms. The margin is 20:1. For the canonical bbb playback test, Option A is statistically safe.
Option B (V4L2_MEMORY_DMABUF + userspace allocation) requires gbm or a similar allocator, inversion of the buffer ownership model, and a rewrite of the mmap chain. It is correct but it is iteration 3+ work. The plan correctly defers it.
One honest addition the plan does not state: Option A's "close our fd" step in BeginPicture is necessary but not sufficient. The plan says "close OUR copy of the EXPBUF'd fd" when force-recycling an EXPORTED slot. But `v4l2_export_buffer` always creates a new fd on each call (EXPBUF is not refcount-addref; it issues a new file description). The consumer holds an independent fd to the same underlying dma_buf. Closing our fd does not close the consumer's fd and does not reduce the kernel's dma_buf refcount to zero. The close of our copy is correct housekeeping but it is not what makes the slot safe to recycle. The slot becomes safe to recycle only when the consumer closes its fd. Option A's LRU timestamps are a proxy for this — the assumption is "enough time has passed that the consumer probably released it." That proxy is explicitly acknowledged in the plan. The implementation should comment this clearly so a future maintainer does not believe closing our fd is the correctness guarantee.
---
## Question 3: Can pool exhaustion happen in practice with MIN_CAP_POOL=24?
For single-stream mpv with a 16-surface VAAPI pool: no, not in steady state. Here is the accounting:
- 16 surfaces in mpv's pool.
- At any instant: at most 1 surface IN_DECODE (kernel busy), at most 3-4 surfaces EXPORTED and held by compositor, rest FREE.
- The pool has 24 slots. At worst, 16 surfaces have been exported at least once and 4-5 are simultaneously EXPORTED. That leaves 8+ slots in FREE state.
The exhaustion scenario the plan identifies — pause-then-resume or multi-stream — is real. If mpv is paused, the compositor may hold the last-displayed frame indefinitely. With 24 slots and mpv's 16-surface pool, a 16-second pause would cycle through at most 16 EXPORTED slots before the compositor's Wayland surface sync releases them. In practice, Wayland compositors (KWin Wayland) release buffer references within one composition cycle after the surface is unmapped or the window is minimized — typically within 16ms. A paused window still visible on screen will hold exactly one frame, not a growing pile. So exhaustion on pause is very unlikely with 24 slots.
The multi-stream case (two mpv instances simultaneously) is out of scope per Phase 0's locked boundary. Firefox decodes via its own pipeline with GBM-backed surfaces and DUPs the fd (line 1515: `auto rawFd = dup(aDesc.objects[object].fd)`), then holds the dup. The dup is held while `mHoldByFFmpeg` is true. `ReleaseVAAPIData` unrefs it. The key question is whether Firefox can be holding more than one fd per surface simultaneously. The answer from the source is: no, one surface gets one DMABufSurface wrapper, and `ImportPRIMESurfaceDescriptor` `MOZ_DIAGNOSTIC_ASSERT(!mDmabufFds[0])` guards against double-import. Firefox's own surface pool is separate from ours; its hold window is controlled by `mHoldByFFmpeg` + `IsUsedByRenderer()` (which polls `mGlobalRefCountFd` via `poll()`).
So exhaustion is not a realistic concern for single-stream playback. The plan's MIN_CAP_POOL=24 is conservatively sized.
---
## Question 4: Is DRM_FORMAT_MOD_INVALID the right modifier choice for WSI alignment?
This is the weakest point in the plan, and it needs more precision before implementation.
The plan's reasoning is: "DRM_FORMAT_MOD_INVALID tells Mesa 'modifier unknown, treat as implicit / texture-import-only.'" That is a reasonable reading of the linux-dmabuf-unstable-v1 spec comment at line 348-356 of `wayland/linux-dmabuf-unstable-v1-client-protocol.h` in the Firefox reference, which says DRM_FORMAT_MOD_INVALID means "implicit modifier." But the actual behavior depends on which Mesa code path handles the import.
The Firefox source shows the answer directly. At line 1920 of `DMABufSurface.cpp`:
```c
if (mBufferModifiers[aPlane] != DRM_FORMAT_MOD_INVALID) {
attribs.AppendElement(LOCAL_EGL_DMA_BUF_PLANE##plane_idx##_MODIFIER_LO_EXT);
...
}
```
When the modifier IS `DRM_FORMAT_MOD_INVALID`, Firefox omits the modifier attributes from the `eglCreateImage` call entirely. This is the correct path for an "implicit modifier" import — the EGL driver will not check WSI alignment constraints against a known modifier, it will instead handle it as an unmodified linear buffer. This should bypass the "WSI pitch not properly aligned" error, because that error comes from Mesa's WSI/swapchain code path, not from the EGL texture import path.
However, there is a separate concern: for the mpv COMPOSED_LAYERS path (not Firefox's SEPARATE_LAYERS path), mpv goes through a different Mesa import path. mpv calls `vaExportSurfaceHandle` and passes the descriptor to Mesa's VA-API frontend, which then calls `eglCreateImage` with the attributes from our descriptor. The modifier attribute presence/absence follows the same pattern — if we set MOD_INVALID in `objects[i].drm_format_modifier`, Mesa's VA frontend will also omit the modifier from EGL attribs.
The specific WSI pitch alignment error is generated by Mesa's Panfrost WSI code (for Mali GPU). This is the path that fires: Mesa's WSI swapchain (not texture import) requires GPU-aligned buffers because the scanout engine has alignment requirements. The "WSI pitch not properly aligned" error comes from Mesa checking whether the buffer can be used as a scanout buffer, not as a texture. Setting MOD_INVALID would tell Mesa "I don't know the modifier, don't assume scanout-compatible" — which is exactly right.
What the plan does not address: is there a Mesa/Panfrost fallback path that succeeds for MOD_INVALID but produces correct pixel output? There should be: MOD_INVALID in the import causes Mesa to texture-upload rather than DMA-import for scanout. The plan acknowledges the perf cost; the correctness should be fine.
One risk not mentioned: `video_format_is_linear()` in our code returns `format->drm_modifier == DRM_FORMAT_MOD_NONE`. If we change the export modifier to MOD_INVALID without touching this function, the `VA_SURFACE_ATTRIB_MEM_TYPE_DRM_PRIME` (first-generation DRM prime, non-modifier) capability advertised in `QuerySurfaceAttributes` will still be set (it only checks `video_format_is_linear()`). That is not a problem for iteration 2 but is a latent inconsistency.
More important: changing the modifier globally (in `video.c` by changing `DRM_FORMAT_MOD_NONE` to `DRM_FORMAT_MOD_INVALID`) would break the mpv vaapi-copy path if mpv vaapi-copy also goes through an EGL import that relies on the modifier being LINEAR. Check whether mpv's vaapi-copy calls `vaExportSurfaceHandle` — reading `image.c::copy_surface_to_image` shows it reads `destination_data[i]` via mmap (line 169-171), it never calls vaExportSurfaceHandle. So vaapi-copy is unaffected. Good.
The change is therefore safe to make globally at the export level, not needing per-consumer branching. But for clarity, do it at the export site (`RequestExportSurfaceHandle`) rather than in `video.c`, so the global format descriptor doesn't carry a modifier that doesn't match the actual physical allocation (which is genuinely linear, just not GPU-aligned for scanout).
Concrete recommendation for implementation: in `RequestExportSurfaceHandle`, when setting `surface_descriptor->objects[i].drm_format_modifier`, compute the value conditionally:
- If bytesperline is not 64-aligned: use `DRM_FORMAT_MOD_INVALID`.
- Else: use `video_format->drm_modifier` (which is `DRM_FORMAT_MOD_NONE` = LINEAR).
This avoids regressing 1920-wide content (which is 64-aligned) that currently works with mpv vaapi. The plan's "try MOD_INVALID universally first" is a valid test-first approach, but the implementation should be this conditional.
One further unresolved question the plan raises but does not answer: does Mali/Panfrost's EGL implementation on this device actually implement the `EGL_DMA_BUF_PLANE0_MODIFIER_LO/HI_EXT` path, and does MOD_INVALID without modifier attribs produce a valid EGLImage for Panfrost? This is a live-test question that cannot be answered from source reading alone. The plan's "1 line, test immediately" approach is correct. If MOD_INVALID causes `eglCreateImage` to fail entirely on this Panfrost driver, the fallback is to round up pitch in the descriptor (sub-option WSI-1) with a comment about the aliasing.
---
## Question 5: Multi-resolution cache invalidation — are there missing reset paths?
The plan identifies two reset triggers: REQBUFS(0) in the resolution-change path, and RequestDestroyContext. Reading the actual code in `context.c:203-251`:
`RequestDestroyContext` does:
1. STREAMOFF output + capture
2. RequestDestroySurfaces
3. REQBUFS(0) on output
4. REQBUFS(0) on capture
It does NOT touch `LAST_OUTPUT_WIDTH/HEIGHT`. The proposed Fix 1 adds the invalidation here. Good.
But the plan misses a critical structural problem: `LAST_OUTPUT_WIDTH` and `LAST_OUTPUT_HEIGHT` are `static` process-globals in `surface.c` (lines 69-70). The `request_data` struct is per-driver-context, but these statics are per-process-load-of-the-shared-library. In a multi-context scenario (two libva contexts open simultaneously), they are shared. More concretely: when Firefox creates a context for video A and another context for video B at different resolutions, they share these globals. Whichever CreateSurfaces2 runs last wins. This is the 7.3 issue from iteration 1 ("multi-context safety not verified").
Iteration 2 says 7.3 is addressed by the multi-resolution work. But the fix as described — "zero out LAST_OUTPUT_WIDTH/HEIGHT in RequestDestroyContext" — only helps for sequential sessions, not concurrent ones. For iteration 2's scope ("multi-video Firefox session"), sequential is the relevant case (Firefox in RDD process has one video at a time). But the plan should state this explicitly rather than leaving readers to assume the fix covers concurrent use.
Are there other missing reset paths?
STREAMOFF is not currently a reset trigger. Looking at how STREAMOFF is called: `v4l2_set_stream(..., false)` is called in DestroyContext before REQBUFS(0). Does STREAMOFF reset the kernel's CAPTURE format to 48×48? The hypothesis in Phase 2 says the format reverts on REQBUFS(0). Reading the hantro_v4l2.c source: REQBUFS(0) on the capture queue calls `vb2_core_reqbufs`, which (in videobuf2-core.c) calls `__vb2_queue_free``__vb2_queue_cancel` → this does NOT reset the format. The V4L2 specification says the format is not reset by REQBUFS. So where does the 48×48 regression come from?
The Phase 2 analysis hypothesizes "the kernel CAPTURE format reverts to 48×48 on REQBUFS(0) at session end." But V4L2 spec and kernel code say REQBUFS(0) does not reset format. The actual reset mechanism is more likely: after REQBUFS(0), the driver internally resets to the default minimum format because the next S_FMT call on the OUTPUT queue (at new session start) sets the OUTPUT format, and hantro derives CAPTURE format from OUTPUT format. But our code does NOT call S_FMT on CAPTURE (the comment at surface.c line 157-170 explicitly says not to). So after a DestroyContext, the CAPTURE format reflects the last OUTPUT S_FMT — which may or may not be 48×48 depending on whether the resolution-change path ran.
Actually, rereading: the 48×48 corruption in Firefox is described as "after 864→1920→..." sequence. The probe surfaces are 128×128 or similar; hantro's minimum may be 48×48. If the probe surface S_FMT runs at 48×48 and the cache says "already set", the format stays at 48×48. After DestroyContext and REQBUFS(0), the kernel format stays at whatever it last was. The next CreateSurfaces2 at the real resolution checks the cache, sees "0,0" (after invalidation), and re-S_FMTs. That is the correct flow.
So the fix is: invalidate cache on DestroyContext. The REQBUFS(0) in the resolution-change path (surface.c line 122) already handles mid-session changes correctly by then re-S_FMTting. The missing case is cross-session state persistence, which DestroyContext invalidation addresses.
One missing reset path that the plan does not mention: the REQBUFS(0) at surface.c line 122 (resolution change mid-session, line `v4l2_request_buffers(driver_data->video_fd, output_type, 0)`) is only on the OUTPUT queue, not CAPTURE. After this, the code calls `v4l2_set_format` on OUTPUT and then `v4l2_get_format` on CAPTURE. The CAPTURE format is derived from the new OUTPUT format by hantro. This should work correctly, but verify that REQBUFS(0) on CAPTURE is not also needed before S_FMT — the comment says "S_FMT rejected by V4L2 while buffers exist", and `v4l2_create_buffers` creates CAPTURE buffers later. If there are existing CAPTURE buffers from a previous resolution that haven't been torn down, would S_FMT on OUTPUT (which derives CAPTURE format) fail? The code does not explicitly REQBUFS(0) on CAPTURE in this path. This is a latent bug potentially separate from Fix 1, but triggered by the same multi-resolution pattern.
Recommendation: also REQBUFS(0) on CAPTURE in the resolution-change path (surface.c line ~115-132) before re-S_FMTting. Audit whether this is currently done. If not, Fix 1 is incomplete for the mid-session case.
---
## Question 6: Edge cases in Fix 3 (buffer pool refactor)
**vaapi-copy and `destination_data[i]`**
The vaapi-copy path uses `copy_surface_to_image` which reads `surface_object->destination_data[i]` (image.c line 170). These pointers are set in `RequestCreateSurfaces2` at surface.c lines 251-252 and 261-263. They point into the mmap of the permanently-bound CAPTURE buffer.
With the pool refactor, `destination_data[i]` cannot live in `surface_object` anymore — it must move to `cap_pool_slot`. But `copy_surface_to_image` still calls `surface_object->destination_data[i]`. Either:
(a) The surface object retains a pointer to its currently-bound slot's mmap, updated on every BeginPicture binding, or
(b) `copy_surface_to_image` is refactored to look up the slot via `surface_object→current_slot`.
The plan's pseudocode for `cap_pool_slot` carries `map[VIDEO_MAX_PLANES]`, which is the mmap pointer. The plan says "the bound slot's mmap is valid at copy time" as an assertion. For this to hold, the slot must remain bound through SyncSurface and until the copy is done in DeriveImage/GetImage. The current call sequence for vaapi-copy is:
`EndPicture``SyncSurface` (QBUF, wait, DQBUF) → surface status = `VASurfaceDisplaying` → consumer calls `vaDeriveImage``copy_surface_to_image`.
In the pool model: at `EndPicture`, the slot is IN_DECODE. After DQBUF in SyncSurface, the slot becomes... what? The plan says FREE after DQBUF, EXPORTED after ExportSurfaceHandle. But for vaapi-copy, Export is never called — the consumer copies via DeriveImage instead. So the slot would be in FREE state when `copy_surface_to_image` runs.
If a concurrent decode happens between SyncSurface and DeriveImage, the slot could be claimed by BeginPicture for a new frame. This is the specific race the plan needs to address for vaapi-copy.
The correct fix: the slot should stay bound to the surface (not returned to FREE) until either Export is called (and returns EXPORTED) or the consumer indicates it's done (which for vaapi-copy is after DeriveImage completes). A slot state of DECODED (between DQBUF and either Export or DeriveImage) would capture this. The plan's three-state enum (FREE, IN_DECODE, EXPORTED) is missing this state.
This is not a theoretical concern — mpv vaapi-copy is a validated consumer and must not regress.
**Concurrent decode + export (vaSyncSurface → vaExportSurfaceHandle)**
The plan's state machine: FREE → IN_DECODE (QBUF in EndPicture) → ??? after DQBUF in SyncSurface → EXPORTED (after ExportSurfaceHandle).
In the current code, EndPicture calls SyncSurface directly (picture.c line 411: `status = RequestSyncSurface(...)`). So the sequence within a single BeginPicture/EndPicture cycle is synchronous: QBUF → wait → DQBUF → status=VASurfaceDisplaying. The slot is DQBUF'd before EndPicture returns. Then mpv calls vaSyncSurface again (which is a no-op since status is already VASurfaceDisplaying), then vaExportSurfaceHandle.
The transitions are effectively: BeginPicture binds slot (FREE→IN_DECODE), EndPicture QBUFs and waits and DQBUFs synchronously (still IN_DECODE through the whole EndPicture), SyncSurface is a no-op on second call, ExportSurfaceHandle marks EXPORTED.
So the IN_DECODE → EXPORTED transition is not split across two functions in the current flow. The missing state (between DQBUF and Export) is real though — add DECODED state as described above.
Locking: the plan does not address threading. libva may be called from multiple threads (the VA-API spec says implementations must be re-entrant if the consumer uses separate threads). mpv in vaapi mode may call EndPicture/SyncSurface from the decoder thread and ExportSurfaceHandle from the display thread. The pool slot state must be protected by a mutex. The OUTPUT pool's `request_pool` uses no locks — and this is not currently a problem because the OUTPUT pool is accessed only in a single decode thread for mpv. The CAPTURE pool in Fix 3 may face the same threading exposure if Export and BeginPicture are called from different threads. The plan does not address this at all.
Recommendation: add a mutex to `cap_pool` (equivalent to what Firefox's `VideoFramePool` does with `mSurfaceLock`). Use a simple `pthread_mutex_t` in `request_data`. This is not optional if multi-threaded consumers exist.
**Failed decode: slot left in IN_DECODE**
If QBUF succeeds but the decode fails (media request returns error, or DQBUF times out), the slot is IN_DECODE with no path back to FREE. The plan's recovery needs to handle this. The plan does not mention it.
Recovery: in the error path of SyncSurface (surface.c lines 472-476), after `request_fd` is closed, the current code does not touch the slot. In the pool model, the error path must:
1. Call VIDIOC_STREAMOFF to flush the CAPTURE queue (or trust that the slot is reclaimed by kernel on REQBUFS(0)), OR
2. Mark the slot as IN_DECODE_FAILED and let BeginPicture treat it as recyclable.
The simplest recovery: in SyncSurface error path, mark the slot FREE (the kernel buffer is no longer queued after an error — either the request failed and DQBUF drained it, or it's stuck). Add a `v4l2_request_buffers(0)` + `v4l2_request_buffers(N)` cycle to reset if error recovery is needed. This is edge-case territory but should be documented.
**vaDestroySurfaces with an EXPORTED slot**
The plan asks this question but does not answer it in the pseudocode. The correct behavior: close OUR export fd (our EXPBUF fd), do NOT call REQBUFS(0) until ALL surfaces are destroyed. The dma_buf kernel object lives on as long as any fd (ours or consumer's) is open. When we close our fd and the consumer's fd is closed later, the kernel cleans up. Consumer's process does not crash — the fd it holds is still valid as long as it holds it; the underlying memory is reference-counted by the kernel.
This is correct behavior, but the implementation must not munmap the slot's mmap before the consumer closes its fd. Since munmap is per-slot (in RequestDestroySurfaces or pool teardown), and the consumer holds its own fd, the mmap and the consumer's fd are independent. munmap terminates the mmap mapping in our address space; the physical pages remain until the dma_buf's last fd is closed. So munmap on our side is safe even if consumer holds its fd. This is fine.
However: REQBUFS(0) is called in DestroyContext after RequestDestroySurfaces. If a surface is in EXPORTED state when DestroyContext runs, REQBUFS(0) should be safe — the kernel's vb2_queue_free path handles the cleanup of individual buffer state, and the dma_buf refcount for the consumer's fd is independent. The consumer's fd does not become invalid; the V4L2 buffer object is torn down but the dma_buf survives. This is the correct and documented behavior. The plan correctly says "close our_export_fd and let the dma_buf live on."
---
## Question 7: Is this review at the right granularity?
No — this review is missing a critical artifact that would change the risk profile significantly.
The biggest uncertainty in Fix 3 is the state machine transition sequence and the mmap ownership model during pool refactor. The plan's pseudocode is at the level of "what fields go in cap_pool_slot and what functions touch it." What's missing is a state table and the mapping of existing `surface_object` fields to the new model.
Specifically: which `surface_object` fields become INVALID after the pool refactor, and what is the access pattern of each remaining consumer of those fields? The fields that change:
- `surface_object->destination_index`: was permanent, becomes transient (bound in BeginPicture, valid through SyncSurface and ExportSurfaceHandle, undefined after that)
- `surface_object->destination_map[j]`: was permanent mmap, becomes slot's mmap pointer (valid only while slot is bound)
- `surface_object->destination_data[j]`: was permanent pointer, must follow slot binding
- `surface_object->destination_map_lengths[j]`, `destination_map_offsets[j]`: move to slot
- `surface_object->destination_sizes[j]`, `destination_offsets[j]`, `destination_bytesperlines[j]`: must either stay in surface (if fixed by format) or move to slot (if per-buffer)
These are separate questions: `destination_bytesperlines[j]` comes from `v4l2_get_format` at CreateSurfaces2 time (surface.c line 196-198). Since all buffers of the same format have the same bytesperlines, this CAN stay in the surface or the slot — doesn't matter which, as long as it's accessible at ExportSurfaceHandle time.
`destination_sizes[j]` and `destination_offsets[j]` are computed from bytesperlines and format_height (surface.c lines 241-256). These also belong to the format, not the specific buffer. They can stay in the surface object even after the pool refactor.
What MUST move to the slot: the mmap pointer (`map[j]`), map_lengths, map_offsets (for the mmap call). The slot's `v4l2_index` is the binding key.
The review WOULD benefit from pseudocode for BeginPicture (slot acquisition) and RequestDestroySurfaces (slot teardown). Ask for it before implementation starts. Not necessarily before this review is finalized, but before Phase 6 begins.
---
## Question 8: Scope check — is iteration 2 overpacked?
Iteration 2 has three fixes:
- Fix 1: ~5 lines (cache invalidation)
- Fix 2: ~3 lines (modifier change)
- Fix 3: ~150-200 lines (pool refactor)
The iteration 1 close had four load-bearing commits total (9de1be3, d41a4b9, 37c0e72, ac891a0), two of which were in the same scope area as Fix 3. The iteration 2 work is similar in complexity to iteration 1.
Fix 1 and Fix 2 together are a morning's work. Fix 3 is 1-3 days. The scope is manageable. Do not defer Fix 1 and Fix 2 — they are quick wins that unblock the Firefox multi-resolution test case, which is needed to validate the baseline before Fix 3 changes anything. Fix 1 and Fix 2 should land first, with verification, before Fix 3 starts. The plan already specifies this order.
The six deferred items from iteration 1 (sections 7.1-7.6) are correctly out of scope. Do not let them drift back in during Fix 3 implementation. Specifically: 7.3 (multi-context safety for LAST_OUTPUT_WIDTH) is partially addressed by Fix 1 but not fully — explicitly call this out so it is not marked "done" after iteration 2.
---
## Question 9: What specific items might be missed
**9.1 The CAPTURE REQBUFS(0) gap in the resolution-change path**
Surface.c line 121-123: on resolution change, the code does `v4l2_request_buffers(video_fd, output_type, 0)` (OUTPUT only). There is no corresponding REQBUFS(0) on CAPTURE. If CAPTURE buffers exist from a prior resolution, they are not freed. V4L2 will reject `CREATE_BUFS` on CAPTURE if the CAPTURE queue still has buffers from the prior session (actually `CREATE_BUFS` permits adding buffers; S_FMT is what's rejected if buffers exist). The comment in the code says "S_FMT rejected by V4L2 while buffers exist" but the code only does S_FMT on OUTPUT, not CAPTURE. Whether CAPTURE S_FMT is implicitly triggered by the OUTPUT S_FMT in hantro's stateless mode needs a live test. This is a pre-existing issue that Fix 3 may expose — pool refactor will change when REQBUFS(0) on CAPTURE is called.
**9.2 `destination_buffers_count` stays on surface_object but QBUF uses it**
At picture.c line 375-378, `v4l2_queue_buffer` is called with `surface_object->destination_buffers_count`. In the pool refactor, the QBUF is still on the slot's `v4l2_index`. The `destination_buffers_count` (number of planes in V4L2 multi-planar) is a property of the video format, not the specific buffer. It stays in `surface_object` or moves to the driver-level format — either way, it's available. Not a bug risk, just needs to be explicit.
**9.3 The sentinel write in EndPicture (picture.c lines 365-373) uses `destination_map[0]`**
After the pool refactor, `destination_map[0]` is in the slot, not the surface. The sentinel write must reference `slot->map[0]`. This is a correctness issue if the sentinel write remains after Fix 3 (it should, since the debug commits stay until iteration 2 is closed per the plan).
**9.4 `request_pool_acquire` returns the V4L2 buffer INDEX, not the slot index**
Looking at request_pool.c line 111: `return (int)pool->slots[slot].index;` — it returns the V4L2 buffer index (which is `index_base + slot_position`). The caller at picture.c line 245: `slot_index = request_pool_acquire(...)` and at 249: `slot = request_pool_slot(&driver_data->output_pool, (unsigned int)slot_index)`. The `request_pool_slot` function searches by `pool->slots[i].index == index` (line 141), so it finds the slot by V4L2 buffer index. This works but is confusing naming — `slot_index` is actually a V4L2 buffer index.
For Fix 3, the CAPTURE pool's `acquire` function should follow the same pattern for consistency, and the plan's pseudocode should clarify this. The LRU search loop should scan all slots and return the one with oldest `last_used_at_ns` — this is a scan, not a simple ring-buffer acquire like the OUTPUT pool. The output pool's `pool->next` ring pattern is not appropriate for LRU; the cap pool needs a linear scan.
**9.5 The SEPARATE_LAYERS export (Firefox path) exports per-layer fds via VIDIOC_EXPBUF**
Looking at surface.c line 644-650: `v4l2_export_buffer` is called with `surface_object->destination_index` and `export_fds_count = surface_object->destination_buffers_count`. For NV12 in single-buffer mode (v4l2_buffers_count=1), `export_fds_count = 1`. The single fd covers both Y and UV planes via offsets in the descriptor.
After Fix 3, `destination_index` is a transient slot binding. `ExportSurfaceHandle` must be called AFTER SyncSurface (which DQBUFs the slot). The plan accounts for this: slot state after SyncSurface DQBUF → Export marks it EXPORTED. But there is one subtle issue: the plan says "close OUR copy of the EXPBUF'd fd, mark slot as FREE" at BeginPicture time (when the surface is reused). The plan does not say WHERE our copy of the EXPBUF fd is stored. The plan's `cap_pool_slot` has `our_export_fd` field. We call `VIDIOC_EXPBUF` in ExportSurfaceHandle, get a new fd, pass it to the consumer, and also save it as `slot->our_export_fd`. At BeginPicture for the next decode into the same surface, we close `slot->our_export_fd`. This is the plan.
What about the case where ExportSurfaceHandle is called multiple times for the same surface (e.g., consumer calls Export twice)? Currently, each call creates a new EXPBUF fd. After Fix 3, each call would issue a new EXPBUF, get a new fd, pass it to consumer. Our `our_export_fd` would be overwritten. The old `our_export_fd` would leak. This is a pre-existing minor issue but becomes more visible with pool refactor. Recommendation: track export state as "at most one outstanding" and return an error (or assert) on double-Export. VAAPI spec allows the export to be called repeatedly.
**9.6 The static `LAST_OUTPUT_WIDTH/HEIGHT` is not thread-safe**
Both writes (surface.c lines 130-131) and reads (line 115) of these globals are done without any synchronization. If two threads call `RequestCreateSurfaces2` concurrently (which VAAPI allows for multi-context use), there is a TOCTOU race on these globals. Fix 1 adds more reads and writes (zeroing in DestroyContext). This pre-existing issue is not addressed by the plan. For iteration 2's scope (single-context sequential playback), it does not matter. For iteration 3's multi-context work, it will need a mutex or migration to per-context state. Flag this explicitly as a known gap.
**9.7 `RequestBeginPicture` currently calls `RequestSyncSurface` if surface is in `VASurfaceRendering` state (picture.c line 235-236)**
After the pool refactor, this implicit sync before reuse is where "close our export fd" would also need to happen — if the surface being reused in BeginPicture has EXPORTED state, the slot needs cleanup. The plan says this cleanup happens in BeginPicture: "If surface_X has a previously-bound slot in EXPORTED state: close OUR copy of the EXPBUF'd fd, mark slot as FREE." This is correct, but BeginPicture currently only handles `VASurfaceRendering` → sync. The plan needs to extend this to also handle `VASurfaceDisplaying` (which is the state after SyncSurface + Export). The surface state machine: rendering → displaying after SyncSurface, displaying → what? The vaapi-copy path sets status back to VASurfaceReady after DeriveImage (image.c line 216). The DMA-BUF export path does not reset status — it stays VASurfaceDisplaying. BeginPicture on a VASurfaceDisplaying surface would not trigger the sync (sync is only for VASurfaceRendering). The slot cleanup at BeginPicture for an EXPORTED slot is independent of the surface status check. Make sure the implementation checks slot state (not surface status) for the cleanup.
---
## Summary and ranking of issues by implementation risk
**MUST address before Phase 6 starts (architecture blockers)**:
1. Add `DECODED` state to the slot state machine (between DQBUF and Export/DeriveImage). Without this, vaapi-copy races Fix 3. Concretely: slot state after DQBUF in SyncSurface should be `CAP_SLOT_DECODED`, not FREE. It transitions to FREE when BeginPicture recycles it for a new decode. It transitions to EXPORTED when ExportSurfaceHandle is called. DeriveImage reads the mmap while in DECODED state (safe). A concurrent BeginPicture for a DIFFERENT surface must not claim a DECODED slot — the LRU search should only consider FREE slots, not DECODED ones.
2. Add `pthread_mutex_t` to protect cap_pool slot state. Export and BeginPicture may be called from different threads.
3. Clarify the sentinel write patch at EndPicture to use `slot->map[0]` not `surface_object->destination_map[0]`.
**SHOULD address in Phase 6 but not blockers**:
4. REQBUFS(0) on CAPTURE in the resolution-change path at surface.c line ~121 — audit whether this is currently missing and add it to Fix 1 scope if so.
5. Conditional modifier based on pitch alignment (not universal MOD_INVALID) — safer than universal change.
6. Document the `our_export_fd` double-Export behavior.
**Known gaps, defer to iteration 3**:
7. `LAST_OUTPUT_WIDTH/HEIGHT` thread safety (static globals, no mutex, multi-context race).
8. Full Option B (V4L2_MEMORY_DMABUF mode).
9. multi-context concurrent use (two contexts open simultaneously).
---
### Critical Files for Implementation
- `/home/mfritsche/src/libva-multiplanar/libva-v4l2-request-fourier/src/surface.c` (Fix 1 invalidation in DestroyContext path; Fix 2 modifier change in ExportSurfaceHandle; Fix 3 pool allocation and slot management throughout CreateSurfaces2, DestroySurfaces, ExportSurfaceHandle, SyncSurface)
- `/home/mfritsche/src/libva-multiplanar/libva-v4l2-request-fourier/src/surface.h` (new `cap_pool_slot` struct and state enum; changes to `object_surface` to remove now-transient fields or add slot pointer)
- `/home/mfritsche/src/libva-multiplanar/libva-v4l2-request-fourier/src/picture.c` (Fix 3: BeginPicture slot acquisition and cleanup; EndPicture QBUF must use slot index; sentinel write must use slot mmap)
- `/home/mfritsche/src/libva-multiplanar/libva-v4l2-request-fourier/src/image.c` (vaapi-copy path `copy_surface_to_image` reads `destination_data[i]` which must remain valid through DeriveImage in DECODED state; may need to access slot mmap via surface→current_slot pointer)
- `/home/mfritsche/src/libva-multiplanar/libva-v4l2-request-fourier/src/context.c` (Fix 1: `RequestDestroyContext` must zero `LAST_OUTPUT_WIDTH/HEIGHT`; also the callsite that issues REQBUFS(0) on both queues — verify CAPTURE REQBUFS(0) is present in resolution-change path too)